The field of data science deals with the extraction of knowledge from data. From this target a sophisticated set of interdisciplinary tasks arises, which demand different skills in the area of data exploration.
The activities described in this report are intended to illustrate the exemplary implementation of such data science tasks by programming based on a simulated business project. For this purpose, data from an online shop was provided. During the course of the project the import, cleaning and manipulation of data was implemented. Also, the topics exploratory analysis, visualization, experiment analysis and machine learning (construction of a model) were covered and their results are described in the following sections of this report. The realization of all these tasks was either done in the programming language R or Python. The first step, the data cleaning, was even implemented in both languages.
The project’s main task is to analyze the given datasets, especially by creating appropiate visualizations and summary tables for the data, that is suspected to be relevant for the online shop in some way. In addition to this, an experiment analysis has to be executed, which calculates and presents the performance strength of different recomendation systems. Furthermore, a data model should be constructed in order to predict, whether a user of the online shop will order something. This prediction should be based on the browsing behaviour of a person on the website.
The provided data comes from an online shop selling beauty products. There are two datasets: One with data about customer orders and another with data about customer clicks on the website.
The orders dataset consists of one row for each customer order in the time period from the 28. January 2000 to the 3. March 2000. The data contains values characterizing the ordered products, the used payment methods, the order process itself and social data, as by example the customer’s location or age.
Whereas, the clicks dataset contains data referring to customer clicks on the website of the given company from the 14. April 2000 to the 30. April 2000. It is mainly composed of data giving information about the payment methods, customer attributes and preferences, product details and of values describing each click itself as by example the request time or the current page URL. Through this data it is possible to illustrate the whole session course of a customer.
Both datasets share a considerable amount of columns. However, since not every click results in an order and since a session consists normally of more than one click, the contents differ significantly.
Before we started cleaning the data, we copied it to a separated folder. The reason for this was to avoid accidentally altering the original dataset by separating the edited datasets from the original ones. The cleaned data and further forms of the datasets were also saved to this directory. For a structure overview of the aforementioned folder see the Appendix.
The cleaning process consists of the following steps:
Copy the original datasets
Read the data from the files
Add headers
Replace “?” and “NULL” with NA
Drop columns with a 100% ratio of missing data
Reformat datetime cells
Save the result
Save a subset of the cleaned data, containing only 1000 rows, to support a quick view into the cleaned data
A similar cleaning process to the one explained above has been implemented in Python and R.
Note: Python coding chunks are executable in RMarkdown in general, but the Python environment is not persistent across different python chunks for the preview function to run coding. Despite this, the chunks are compiled together, when the document is knitted.
To execute the provided code you may need to enter your Python path in a R coding block by “use_python()”. If this does not work, copy the Python coding into the Python IDE of your choice, ideally under the main repository directory “./”.
In addition to having the packages installed, the software “graphviz” needs to be installed. If any bugs related to graphviz prevent you from running the code, you can use the decision tree figure provided in this report as a reference.
To test if the cleaning scripts in Python and R result in the same file, we implemented a short coding to create a diff view. The result showed that there are no differences in the cleaned versions of the datasets created through both languages.
Note: As we explain in this subchapter, the merged data is too small to be beneficial for any useful purpose. This is why we did not implement the merging in R after coding it in Python.
We tried to merge the click and order data in Python by trying different ID combinations that occur in both datasets. For testing the different combinations we used an inner join in order to be able to recognize easier, whether a merging try had success. We tried the following combinations for merging the two datasets, which resulted in the shown shapes for the merged dataset:
| Clicks | Orders | Shape |
|---|---|---|
| Session ID | Order Line Session ID | [0, 438] |
| Session ID | Order Session ID | [0, 438] |
| Customer ID | Customer ID | [6906, 437] |
| Session Cookie ID | Order Line Session ID | [0, 438] |
| Session Cookie ID | Order Session ID | [0, 438] |
In this way we were able to discover that it is possible to join the datasets on the ‘Customer ID’ for some instances. Thus, we saved a dataset for the merging results on the Customer ID. But the merged data does not make much sense, since a customer ID can have multiple order and click rows. Another problem is that the time periods of the dataset do not overlap in any way. Because of this issues, we decided on building a second, smaller data subset containing only the customer information columns of both original datasets. The final merged customer dataset contains 80 attributes for 97 customers. Since only such a minor ratio of data could be merged, we consider the joined dataset as rather unimportant and did not perform any further analytical steps based on it.
The aim of the data analysis is to extract information, which is suspected to be valuable to the online shop, and prepare it in a way that makes it easily “digestible”. The overview of the information is presented in summary tables and different kinds of visualizations.
Before creating overview tables or plots for columns, it makes sense to evaluate which columns actually contain a large quantity of information and which do not. To do a check up on the ratio of filled cells, we created a ranking for both datasets containing column names and the percentage of missing data for each column. Columns with a low percentage of missing data are then preferred in later analysis steps. To offer an impression on the results of this analysis, the first 20 entries of the resulting rankings can be seen in the following two tables.
| Order Column | Order NA % | Click Column | Click NA % | |
|---|---|---|---|---|
| Order Line Date | 0 | Request Processing Time | 0.000 | |
| Order Line Date_Time | 0 | Request Date | 0.000 | |
| Order Line Unit List Price | 0 | Request Date_Time | 0.000 | |
| Order Line ID | 0 | Request Sequence | 0.000 | |
| Order Line Quantity | 0 | Request Template | 0.000 | |
| Order Line Unit Sale Price | 0 | REQUEST_DAY_OF_WEEK | 0.000 | |
| Order Line Status | 0 | REQUEST_HOUR_OF_DAY | 0.000 | |
| Order Line Tax Amount | 0 | Cookie First Visit Date | 0.000 | |
| Order Line Amount | 0 | Cookie First Visit Date_Time | 0.000 | |
| Order Line Day of Week | 0 | Session First Request Date | 0.000 | |
| Order Line Hour of Day | 0 | Session First Request Date_Time | 0.000 | |
| City | 0 | Session Cookie ID | 0.000 | |
| US State | 0 | Session ID | 0.000 | |
| Account Creation Date | 0 | Session User Agent | 0.000 | |
| Account Creation Date_Time | 0 | Session Visit Count | 0.000 | |
| Account Status | 0 | Session First Processing Time | 0.000 | |
| Customer ID | 0 | Session First Template | 0.000 | |
| Order Date | 0 | Session First Request Day of Week | 0.000 | |
| Order Date_Time | 0 | Session First Request Hour of Day | 0.000 | |
| Order Customer ID | 0 | Session First Content ID | 0.001 |
The order data can be mainly divided into 4 sections:
The clickstream data has three main categories: Customer data, product data and time data. The clickstream dataset contains payment methods as well, but the information coming from this attributes is not regarded in this observation, since the payment data is only available for customers that actually ordered. For analysis purposes only the most important or interesting attributes are discussed in detail. For the full range of columns see the Appendix.
Given a subset of interesting columns, we create two types of summary tables for each: One table for numerical columns in the subset and another for factors. The summary table for the numerical data contains the maximum value, minimum value, mean, median and standard deviation for each column. Whereas, the factorial tables contain the five most frequent factors as well as their percentage, the ratio of NAs and other factors for each column. An important aspect to mention for the factor tables, is that the NA percentage gets calculated at first, then the NA values are deleted from the regarded column and the percentage for each factor value is calcuated.
Note: To support a better visualization, the most relevant columns are highlighted in black.
In the following section the summary tables generated for the purpose of describing the order data are shown. Additionally, the most important or interesting analysis results are emphasized and shortly explained. Important to note is that the analysis is based on each order. That means the data of a customer can possibly be regarded multiple times in the customer analysis if he ordered more than once in the shop.
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| Age | 98 | 38.37 | 36 | 18 | 10.87 |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| City | New York: 4.53% | San Francisco: 2.05% | Stamford: 1.24% | Austin: 1.13% | Brooklyn: 0.98% | 90.07% | 0% |
| Country | United States: 100% | 0% | 2.83% | ||||
| US State | CA: 14.63% | NY: 14.11% | TX: 6.93% | PA: 5.8% | CT: 5.28% | 53.25% | 0% |
| Marital Status | Married: 66.13% | Single: 22.02% | Inferred Single: 7.15% | Inferred Married: 4.7% | 0% | 34.98% | |
| Gender | Female: 83.06% | Male: 16.94% | 0% | 44.96% | |||
| Audience | Women: 81.17% | Men: 12.5% | Children: 6.33% | 0% | 11.08% | ||
| Truck Owner | False: 78.55% | True: 21.45% | 0% | 22.22% | |||
| RV Owner | False: 91.5% | True: 8.5% | 0% | 22.22% | |||
| Motorcycle Owner | False: 98.66% | True: 1.34% | 0% | 22.22% | |||
| Working Woman | False: 68.79% | True: 31.21% | 0% | 22.22% | |||
| Presence Of Children | False: 54.66% | True: 45.34% | 0% | 22.22% | |||
| Speciality Store Retail | False: 84.12% | True: 15.88% | 0% | 22.22% | |||
| Oil Retail Activity | False: 91.8% | True: 8.2% | 0% | 22.22% | |||
| Bank Retail Activity | False: 75.44% | True: 24.56% | 0% | 22.22% | |||
| Finance Retail Activity | False: 91.69% | True: 8.31% | 0% | 22.22% | |||
| Miscellaneous Retail Activity | False: 94.88% | True: 5.12% | 0% | 22.22% | |||
| Upscale Retail | False: 94.25% | True: 5.75% | 0% | 22.22% | |||
| Upscale Speciality Retail | False: 96.44% | True: 3.56% | 0% | 22.22% | |||
| Retail Activity | False: 60% | True: 40% | 0% | 22.22% |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| StockType | Replenishable: 69.58% | Seasonal 1: 23.08% | Replenishment: 5.14% | Seasonal 1*: 1.83% | Seasonal 2: 0.24% | 0.13% | 14.72% |
| Manufacturer | American Essentials: 20.91% | Ridgeview: 16.64% | HAN: 13.16% | Donna Karan Company: 10.85% | HOSO: 10.67% | 27.77% | 6.15% |
| BrandName | AME: 22.07% | HOSO: 11.26% | ELT: 10.81% | Silk Reflections: 9.35% | DAN: 7.92% | 38.59% | 11.08% |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| Order Credit Card Brand | VISA: 59.94% | MC: 25.43% | AMEX: 14.31% | DISC: 0.31% | NA | 0.01% | 16.71% |
| Bank Card Holder | True: 86.57% | False: 13.43% | NA | 0% | 22.22% | ||
| Gas Card Holder | True: 75.81% | False: 24.19% | NA | 0% | 22.22% | ||
| Upscale Card Holder | True: 54.1% | False: 45.9% | NA | 0% | 22.22% | ||
| Unknown Card Type | False: 56.18% | True: 43.82% | NA | 0% | 22.22% | ||
| TE Card Holder | False: 89.42% | True: 10.58% | NA | 0% | 22.22% | ||
| Premium Card Holder | False: 75.88% | True: 24.12% | NA | 0% | 22.22% | ||
| New Bank Card | False: 99.55% | True: 0.45% | NA | 0% | 22.22% |
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| Order Line Quantity | 18 | 1.31 | 1.0 | -2 | 0.95 |
| Order Line Unit List Price | 72 | 9.26 | 7.5 | 0 | 6.46 |
| Order Line Amount | 234 | 11.62 | 10.0 | -40 | 11.51 |
| Order Line Hour of Day | 23 | 13.04 | 13.0 | 0 | 5.29 |
| Order Discount Amount | 50 | 8.82 | 10.0 | 0 | 9.98 |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| Spend Over 12 Per Order On Average | False: 64.04% | True: 35.96% | 0% | 0% | |||
| Order Line Day of Week | Wednesday: 26.96% | Tuesday: 17.72% | Thursday: 17.37% | Friday: 16.48% | Saturday: 8.43% | 13.04% | 0% |
| Order Promotion Code | FRIEND: 82.09% | SPRING: 2.14% | MARCH1: 1.92% | FREE: 1.39% | 4128003160593466: 1.13% | 11.33% | 23.15% |
The following tables display only the most important columns from the clickstream dataset. Interesting details are discussed in short texts. For the full range of columns see the Appendix.
Since the clickstream dataset contains a row for every click of a customer, the social information for a user gets easily multiplied by the browsing behaviour. Thus, the data has a high probability of being skewed when it comes to customer information. Therefore, we calculated the customer summary based on only one row per session.
The age has its average at 37.58 and its median at 36, implying a main customership in their late 30s.| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| Age | 86 | 37.58 | 36 | 18 | 10.71 |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| City | San Francisco: 2.31% | New York: 2.18% | Chicago: 1.35% | Stamford: 1.09% | Dallas: 0.71% | 92.36% | 96.92% |
| US State | CA: 13.28% | NY: 11.1% | TX: 5.97% | PA: 5.39% | IL: 4.68% | 59.58% | 96.92% |
| Marital Status | Married: 61.8% | Single: 24.7% | Inferred Married: 7% | Inferred Single: 6.5% | 0% | 98.02% | |
| Gender | Female: 83.33% | Male: 16.67% | 0% | 98.43% | |||
| Audience | Women: 85.44% | Children: 10.27% | Men: 4.29% | 0% | 98.25% | ||
| Truck Owner | False: 77.84% | True: 22.16% | 0% | 97.72% | |||
| RV Owner | False: 91.17% | True: 8.83% | 0% | 97.72% | |||
| Motorcycle Owner | False: 98.79% | True: 1.21% | 0% | 97.72% | |||
| Working Woman | False: 65.54% | True: 34.46% | 0% | 97.72% | |||
| Presence Of Children | False: 50.39% | True: 49.61% | 0% | 97.72% | |||
| Speciality Store Retail | False: 84.59% | True: 15.41% | 0% | 97.72% | |||
| Oil Retail Activity | False: 90.65% | True: 9.35% | 0% | 97.72% | |||
| Bank Retail Activity | False: 77.58% | True: 22.42% | 0% | 97.72% | |||
| Finance Retail Activity | False: 90.3% | True: 9.7% | 0% | 97.72% | |||
| Miscellaneous Retail Activity | False: 94.37% | True: 5.63% | 0% | 97.72% | |||
| Upscale Retail | False: 94.37% | True: 5.63% | 0% | 97.72% | |||
| Upscale Speciality Retail | False: 96.36% | True: 3.64% | 0% | 97.72% | |||
| Retail Activity | False: 64.33% | True: 35.67% | 0% | 97.72% |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| StockType | Replenishable: 60.93% | Seasonal 1: 23.75% | Seasonal 1*: 11.25% | Seasonal 2: 2.07% | Replenishment: 1.83% | 0.17% | 80.3% |
| Manufacturer | Donna Karan Company: 10.79% | Peneco: 9.08% | HAN: 8.66% | Kneipp: 6.61% | Paul Lavitt Mills Inc.: 6.54% | 58.32% | 80.12% |
| BrandName | DKNY: 9.99% | Silk Reflections: 9.21% | ORO: 8.85% | HPK: 7.22% | AME: 7.13% | 57.6% | 86.16% |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| Request Date | 2000-04-27: 10.35% | 2000-04-28: 7.92% | 2000-04-17: 7.43% | 2000-04-19: 7.38% | 2000-04-15: 7.25% | 59.67% | 0% |
| REQUEST_DAY_OF_WEEK | Saturday: 16.66% | Thursday: 15.4% | Sunday: 14.97% | Tuesday: 13.56% | Wednesday: 13.32% | 26.09% | 0% |
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| REQUEST_HOUR_OF_DAY | 23 | 11.30 | 11 | 0 | 6.2 |
| Session Visit Count | 974 | 13.02 | 1 | 1 | 73.8 |
| Request Date | 2000-04-30 | NA | NA | 2000-04-14 | NA |
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| Session Duration | 1439.983 | 8.88 | 2.55 | 0 | 74.34 |
| Click Number | 1842.000 | 46.93 | 4.00 | 1 | 189.84 |
When comparing the product and customer information for the order and click dataset directly, some interesting observations can be made.
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| Orders | 98 | 38.37 | 36 | 18 | 10.87 |
| Clicks | 86 | 37.58 | 36 | 18 | 10.71 |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| Orders | New York: 4.53% | San Francisco: 2.05% | Stamford: 1.24% | Austin: 1.13% | Brooklyn: 0.98% | 90.07% | 0% |
| Clicks | San Francisco: 2.31% | New York: 2.18% | Chicago: 1.35% | Stamford: 1.09% | Dallas: 0.71% | 92.36% | 96.92% |
| Variable | Top.1st | Top.2nd | Top.3rd | Others | Not.Available |
|---|---|---|---|---|---|
| Orders | Women: 81.17% | Men: 12.5% | Children: 6.33% | 0% | 11.08% |
| Clicks | Women: 85.44% | Children: 10.27% | Men: 4.29% | 0% | 98.25% |
| Variable | Top.1st | Top.2nd | Others | Not.Available |
|---|---|---|---|---|
| Orders | False: 54.66% | True: 45.34% | 0% | 22.22% |
| Clicks | False: 50.39% | True: 49.61% | 0% | 97.72% |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| Orders | Replenishable: 69.58% | Seasonal 1: 23.08% | Replenishment: 5.14% | Seasonal 1*: 1.83% | Seasonal 2: 0.24% | 0.13% | 14.72% |
| Clicks | Replenishable: 60.93% | Seasonal 1: 23.75% | Seasonal 1*: 11.25% | Seasonal 2: 2.07% | Replenishment: 1.83% | 0.17% | 80.3% |
Manufacturer and Brand: Donna Karan is leading the top manufacturer and the top brands, but is closely followed by its competitors. Surprising is also the lack of presence of American Essentials in the click dataset, since it leads both the manufacturer and brands ranking for the order data. This indicates that American Essentials has a very high rate of order in ratio to product views.
Manufacturer| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| Orders | American Essentials: 20.91% | Ridgeview: 16.64% | HAN: 13.16% | Donna Karan Company: 10.85% | HOSO: 10.67% | 27.77% | 6.15% |
| Clicks | Donna Karan Company: 10.79% | Peneco: 9.08% | HAN: 8.66% | Kneipp: 6.61% | Paul Lavitt Mills Inc.: 6.54% | 58.32% | 80.12% |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| Orders | AME: 22.07% | HOSO: 11.26% | ELT: 10.81% | Silk Reflections: 9.35% | DAN: 7.92% | 38.59% | 11.08% |
| Clicks | DKNY: 9.99% | Silk Reflections: 9.21% | ORO: 8.85% | HPK: 7.22% | AME: 7.13% | 57.6% | 86.16% |
Some information is too complex to be compressed into a single table without making it too confusing, or it’s simply easier to understand if presented as a plot. The plot types used are time series plots, stacked bar plots, distribution curves, lorenz curves and maps.
The customer data of the order data can be viewed from two perspectives: One way is to use every single order row for the creation of the visualizations and therby create a weighted view on the data, in which customers that have bought more products are more respected. Another possibility is to display the customer information just for every unique customer in the order data and disregard the number of orders a customer made. The following plots show both perspectives.
Firstly, we generated some density curves for the attribute age to visualize the distribution of the customership. Also, we differentiated between the genders for this. The curves all show a fast rise in customership for the ages 20 to 40, which decreases slowly after a peak at about 35 to 40. When comparing the weighted and the normal graph, a slight shift of the curve can be observed. This indicates that in general older people tend to order more. This is especially relevant for an older male customership, which is shown by the peak at about 55 in the male curve of the weighted density plot. Additionally, a less wide peak can be observed for the male customers, which has its maximum shortly before the age of 40. Furthermore, it should be regarded, that the gender plot shows a percentual curve for each gender, but the ratio of customers differs by gender.
In order to show the customer location, we generated a map of the United States, which shows the cities with the highest customer numbers. From this it can be observed that on average the west coast of the US orders the most. The difference between the weighted and the other customer graph shows that the customership from large cities like New York or San Franscisco tends to have high order numbers since the large circles at these cities disappear, when we look at the non-weighted graph.
To visualize the importance of different areas, we created a heatmap for the different US States. Here we can see that California and New York are the most important customer states. This is probably highly influenced by the big cities in these states. Here, the distorting factor of population density must be taken into account.
For the product data, it has to be regarded that we visualize the orders and thereby the characteristics of more frequently bought products have a higher influence. Because of this, the following plots should be seen as popularity graphs of different product attributes.
The following stacked bar plot shows the stock types for the top brands and manufactures. Most of them have a replenishable assortment. The biggest brands and manufacturers seem to have mixed stock types, which contain partly seasonal products.
Note: For the lorenz curves we dropped the x-axis tick labels for a better readability. For details see the top attributes for products in the summary tables.
The lorenz curve of the products shows that only a quarter of the whole product quantity is responsible for about 75% of the ordered products. This indicates that some products have a very high popularity. The curve is pretty steady, which indicates that the product popularity slowly and steadily decreases over the ranking.
The manufacturer lorenz curve shows a high ratio on orders for the biggest manufacturer, which leads to the assumptions that the popularity of some manufacturers is even higher than the product popularity. Furthermore, the graph seems to be a little pointy showing that the nine most popular manufacturers create the majority of products sold on the online shop.
The brand lorenz curve looks a little more flat, indicating a higher relevance for the manufacturer than for the brand. The curve here is a little pointy at two parts. The first part can be explained due to the high ratio of American Essentials on the orders (as we know from the summary tables). Another interesting anomaly in the curve can be observed after the top 11 brands, because the ratio of the following brands decreases strongly after this point.
The stacked bar plot for credit card brands shows the ratio of each brand on premium and on upscale cards. The card brand AMEX is noticeable due to a relative high ratio of premium and upscale cards.
For the order process data some graphs referring to the order amount and price were created. It has to be mentioned again that the order data can deform the graphs, by example it is imaginable that the customers tend to buy the cheaper products and therefore the average product price seems lower.
The density curve for the discount amount shows 3 peaks: The first one has a medium height and is around a discount of 0%, the second one is around the 10% mark and is pretty large, whereas the last peak is at a discount of 50% and is rather low. This might indicate that customers buy rather targeted than randomly.
A density curve reflecting the order time is shown by the next plot. As to be expected the order amount goes down through the night. During the day the ordering is relative stable with some small peaks at 10 a.m. and in the afternoon. The activity in the afternoon can be explained by the average working hours, which mostly allow people only to spend time on online shopping in the afternoon and evening.
Next, there is displayed a graph, which summarizes some information on order behaviour. From this visulalization we can learn that the online shop sells products of a low price segments, but usually receives orders contaning a rather high amount (peak at 3-12) of articles. The different history plots on the right indicate a high activity in the first half of February, which drops in the beginning of the second month half and than slowly rises again. The first half could possibly be explained through customers buying presents for the valentines day (14th February).
The map clearly shows a very high browsing activity for the large US cities. Especially the east coast at the area around New York has a high ratio of online shop visitors.
The US states plot shows that there is a high activity in California and New York. In general, there is a lower activity at the states in the center.
Note: For the lorenz curves we dropped the x-axis tick labels for a better readability. For details see the top attributes for products in the summary tables.
The lorenz curve for the products is rather flat, which indicates there are no big outliers for the product popularity counted by views. Furthermore, the curve seems to be a little pointy at about 60% of the products, which implies a product segment of about 40% with a really low interest rate.
The following graph appears a bit erratic, indicating that manufacturers can be divided into several popularity groups. Additionally, the curve is rather extended, which shows a highly unequal distribution for manufacturers viewed. This implies a high importance of the manufacturer for the viewership.
Whereas the brand lorenz curves shows a more equally distributed popularity, because it is overall rather flat. At about 50% of the brands, a pointy part can be seen in the graph, which divides the brands in a rather popular half and a rather unpopular one.
Especially interesting for analysis purposes is the time component that comes with the clickstream dataset. Each clickstream consists of timestamps and a number in a sequence.
The following plot shows that most of the clicks are accumulated in the morning and then descend over day. The extreme spike at around 2 a.m. can be neglected as it seems to be bot activity with extremely long session times and sequences.
The mentioned behavior is also visible in the click time overall. Generally, there is more activity during the middle of April and less towards the end. Also, the seasonal component of the clickstream data set is visible very well in this kind of visualization.
While the session length on average is about 8 minutes, the median lies at about 2 minutes (see table). This is visible in the graph where the high point is also at about 2 minutes, indicating the skewness of the time data. Most sessions last only around 10 minutes, which leads to the assumption that products should be displayed well as the customers do not waste much time on searching for them. For a better visualization we left out sessions with a duration over 40 minutes.
For some of the graphs a direct comparison makes sense in order to visualize the differences between the orders and clicks dataset.
The maps clearly show the numerical difference in both dataset. Especially Chicago and Dallas have to be highlighted because both cities have a really high viewer amount, but a rather small customership. One possible reason for this effect would be that the acitve bots originate from these two cities. The bot presence from Dallas could be explained by the high ratio of IT experts living there.
Orders Map
Clicks Map
Note: For the lorenz curves we dropped the x-axis tick labels for a better readability. For details see the top attributes for products in the summary tables.
In the following the product data lorenz curves for both datasets will be compared. In each case the order plot is displayed on the left and the click plot on the right. For all plots it can be observed that the curve for the order data is more tilted towards the upper left corner. Thus, it can be stated that the orders contain a higher inequality for the popularity of product attributes. This means that people usually seem to pick the same products or brands, despite having a bigger variance in the viewed products.
In addition to analyzing the order and clickstream data, we analyzed the performance of different recommendation models. The performance of three different recommendation systems was measured:
The evaluation of the profit and ranking based recommendation systems was done using inference analysis, specifically using the computational paradigm instead of the mathematical one. One test for each of the two recommendation systems was carried out with the null hypothesis always being that the system does not cause different sales than a purely random recommendation system. During each test we randomize our sample data 1000 times, using either permutation or bootstrapping, and measure the p-value and confidence interval. The test statistic, we always use, is the difference in mean between the group, using the profit or ranking based recomendation system, and the group using the random recommendation system. If the null hypothesis was true, then the test statistic value for our sample would not significantly differ from the distribution of the test statistic for our randomized data. Our default alpha for the confidence interval is 5%, but since we conduct a total of two tests, we have to apply the Bonferroni correction and adjust the alpha we specify for our confidence interval to 2.5%.
Before diving into the inference analysis itself, we had to reformat our data for the recommendation systems into a shape that is suitable for inference analysis. We want to have a data frame in which one row equals one customer, who was exposed to a recommendation system. We use three columns:
If the value in the columns Used_Top_recommendations and Used_Profit_Oriented_recommendations is 0, it means that the random recommendation system was used.
In the following table you can see the first 10 rows of the reformatted dataset:
| Sales_in_EUR | Used_Profit_Oriented_recommendations | Used_Top_recommendations |
|---|---|---|
| 8.50 | 0 | 0 |
| 20.00 | 1 | 0 |
| 16.00 | 0 | 1 |
| 8.50 | 0 | 0 |
| 17.75 | 1 | 0 |
| 19.75 | 0 | 1 |
| 18.75 | 0 | 0 |
| 21.75 | 1 | 0 |
| 19.75 | 0 | 1 |
| 17.75 | 0 | 0 |
We executed an inference analysis for the profit oriented recommendation system. Firstly, we look at the p-value and the corresponding plot:
## [1] "p-value = 0"
The plot shows us the distribution of the test statistic for the 1000 randomized samples. The test statistic value for our sample is represented by a black line. The two-sided p-value regions are marked by a grey background. If our null hypothesis was true, then the test statistic value of our sample would be somewhere in the distribution of the test statistic for the randomized samples. Every test statistic value for a randomized sample, which lies in the p-value region, increases the p-value.
As we can see, the test statistic value of our sample is pretty far away from the test statistic values of the randomized samples. This already shows, without looking at the p-value itself, that the profit oriented recommendation system causes a significant difference in the sales in Euro. The p-value is 0, which reaffirms our interpretation of the plot.
Next, we look at the confidence interval:
| 2.5% | 97.5% |
|---|---|
| 3.320897 | 4.409748 |
As we can see, there is a 95% chance that if the shop starts using the profit oriented recommendation system for all customers, they would spend on average 3.32€-4.41€ more than before.
The next step was to perform an inference analysis for the ranking based recommendation system. Firstly, we look at the p-value and the corresponding plot:
## [1] "p-value = 0.086"
In this plot some instances of the distribution of the test statistic for our random samples lie in the p-value zone. This is also shown by the p-value 0.086, which is greater than 0.025. This means that for our alpha = 0.05 the effect of the ranking based recommendation system is statistically insignificant.
Let us look at the confidence interval:
| 2.5% | 97.5% |
|---|---|
| -0.0073997 | 1.28289 |
Since the confidence interval includes the value 0, it shows us that the effect is statistically insignificant.
To sum it up, the company should use the profit oriented recommendation system, since out of the two tested systems it causes the largest increase in revenue. The ranking based recommendation system does not cause any statistically relevant difference in sales. However if the effect was something else than sales in Euro per person, then the results could be different. The company should ask itself if increasing revenue should really be their only goal for using recommendation systems. Maybe at some point in time the organization could introduce a subscription business model, similar to that of Amazon. In that case it might also be important to increase the percentage of customers that have a subscription.
Following the inference analysis our next task was to predict, whether or not a person browsing the online shop will end up purchasing something, based on the browsing behaviour on the site. Since the vast majority of click sessions do not contain any customer information, due to unregistered users browsing the website or registered users not being logged in, we decided to omit that type of data for our prototypes. However, it is technically possible to include these attributes, which should be considered when developing a more advanced version of our prediction models.
First of all, the relevant data had to be extracted or engineered from the clickstream data and was used to build a separate dataset, which represents the input data for a prediction model. The resulting data set contains one row per session. For each session the following attributes are used:
| Clicks | Duration_in_Seconds | REQUEST_DAY_OF_WEEK | REQUEST_HOUR_OF_DAY | Ordered |
|---|---|---|---|---|
| 14 | 2028 | Friday | 23 | No |
| 8 | 272 | Friday | 23 | No |
| 2 | 39 | Friday | 23 | No |
| 16 | 91 | Friday | 23 | No |
| 15 | 79 | Saturday | 0 | No |
| 14 | 67 | Saturday | 0 | No |
| 16 | 68 | Saturday | 0 | No |
| 3 | 79 | Saturday | 0 | No |
| 7 | 189 | Saturday | 0 | No |
| 3 | 127 | Saturday | 0 | No |
However, an important thing to note is that about 99% of sessions do not contain an order. In addition to that, sessions consisting of only one click are ignored, since they offer no valuable information, because attributes like the session duration cannot be calculated.
To handle model variance and bias we use the GridSearch approach. This allows us to get the optimal bias-variance trade off for each model, without having to inspect the source of variance or bias and fine tune model parameters manually. This way we do not need bias or variance related plots, such as learning curves.
While our approach optimises the bias variance trade off for each model, it does not account for bias and variance differences between several models. The two types of models we use for our problem are decision trees and random forests. By design, random forests have less bias and more variance than decision trees. This means, that depending on the performance of both models, we can see if our main problem is too much bias or variance in the data. For example, if random forests were to perform better than decision trees, it would mean that our main problem is too much bias in the data.
It should also be mentioned that depending on the business goal, the ratio of positive and negative instances in the training set should be adjusted. We use a 1:2 ratio of positives and negatives. Increasing the amount of negatives would cause a model to have an increased rate of true negatives and less false positives. However, it would also increase the rate of false negatives. Increasing the amount of positive instanced in the training set would have the reverse effect.
The first model of choice is a decision tree, because it allows to easily understand the decision making process of the model. Since it removes the need for dimensionality reduction and feature scaling it also reduces the required amount of work. The evaluation metric used during the grid search is ROC-score. However, depending on the evolving needs of the company, one might want to chose a different evaluation metric.
Since the training set is created by using random samples, model performance can vary during each run. However, on average the following could be observed:
Also, normally, the most important features are the number of clicks and the session duration.
The confusion matrix can be seen below. As you can see, the model classifies around 95% of sessions, which contain an order, correctly. However, only around 86% of instances, which do not contain an order, are classified correctly. This brings up the following question: How should each classification type be weighted? To answer this the company should first ask itself: “How, if at all, do we want to target session users differently?” For example, one idea could be to offer discounts to visitors, which have a high click rate during short sessions, as this could imply unsatisfaction with the price or quality of the visited products.
Decision Tree - Confusion Matrix
As mentioned above, the main reason for choosing a decision tree is the ease of understanding its logic and the automatic feature filtering. Below you can see a visualization of the decision tree. It has four layers. The only variables of interest are the session duration and the number of clicks during the session. For example, in this case whenever a session has over 10.5 clicks and lasts for longer than 289.5 seconds the tree will predict that the session contains an order. On the other hand if a session has less than (or precisely) 8.5 clicks, the tree will predict that it does not contain an order.
Decision Tree Vizualized
The second model of choice is a random forest. A random forest is an ensemble model made up of several decision trees. Due to the structure of a random forest it has less bias than a single decision tree, but it suffers from higher variance. It also offers less insight into the logic of the model, since it is made up of several trees. But depending on the forest complexity, it is still possible to present an overview of that logic. For example if the forest is made up of only 5 trees, each tree could be visualized. However, the more trees a forest is made out of, the less transparency it offers.
Again, since the training set is created by using random samples, model performance can vary during each run. The following values could be observed on average:
This shows a 1%-3% percentage drop in performance compared to the previous decision tree model. This indicates a high variance in the data, which favors a decision tree more than a random forest.
The confusion matrix can be seen below. The model classifies around 98% of sessions, that contain an order, correctly. On the other hand it predicts that around 85% of sessions, which do not contain an order, do contain one.
In this case it looks like the model simply trades a lower ratio of false negatives for a higher ratio of false positives, compared to the decision tree. But on average the ratio of both, false positives and false negatives, is slightly higher when compared to the decision tree model.
Random forest - Confusion Matrix
First of all it should be evaluated which correct and false predictions are important. For example, the company might want to offer discounts to potential customers, who would normally not order something during their session. In that case the ratio of false negatives should be kept down as the company does not want to offer discounts to too many people, who would order a product anyway.
After this evaluation a fitting performance measurement metric can be chosen. Depending on the metric the performance ranking of different models can change. In addition to that, the ratio of positive and negative instances used to train the model should be adjusted, as it was described in this chapter.
Another important question is how relevant model transparency is. If it is not required at all to understand how the model predicts, and why, more models and modelling strategies should be considered. For example neural networks would become an option, or methods for automatic dimensionality reduction, such as PCA, or for iterative model improvements, such as Bagging or Boosting, should also be evaluated.
In addition, it should be considered whether a continuous training of the model is a good idea. On one hand it offers the advantge of staying up to date by feeding the newest customer behaviour data to the model. On the other hand it would require a higher investment in monitoring software and personnel as well as IT security. The reason for that is that continous learning creates the risk of model corruption, which requires constant performance monitoring, regular model backups and a quick way to switch to an older version of the model.
Finally, we would like to summarise our most important findings and point out some recommendations for the shop.
The analysis of the given datasets, in the shape of summary tables as well as visualizations, points out some interesting aspects for business decisions of the company providing the online shop. But for all of these results it has to be respected that the compared datasets refer to different time periods and therfore the results might not be that comparable. To address the most important discoveries: It could be observed that younger customer groups tend to browse more and buy less than the older ones. This might indicate a possible application area for special discounts on products adressed towards a rather young customership. The audience analysis results can be regarded as another important finding, because the comparison of both datasets showed that products for men are bought with comparably less browsing activity. This points towards a more goal-oriented customer base, which reduces the relevance for discounts on these products. Furthermore, it could be observed that significantly more seasonal products are viewed than bought. This probably hints towards a customer dissactisfaction regarding seasonal products. Hence, it should be considered to revise the price model for these product groups. From the order manufacturers’ lorenz curve it can be seen that the first ranked quarter of manufacturers has a very high effectivity. Thus, the shop could think about reducing the products created by unpopular manufacturers.
The company should use the profit recommendation system for all customers to increase the revenue. At the same time it should be evaluated if the revenue is the only important metric for recommendation system performance. Other metrics could be the rate of non-customer to customer conversion or the customer subscription rate.
Furthermore, the application of a predictive model requires a cost-benefit matrix to be constructed, which requires developing a strategy for dealing with buyers and non-buyers browsing the shop. One possible strategy is to offer discounts to browsing visitors, which are not likely to order a product during their browsing session. For that example the predictive models, we developed, offer a great performance already, since over 99,95% of predictions, that a visitor will not buy something, are correct. It should be noted however, that it is hard to estimate how many of these people could be tilted towards buying a product after being offered a discount.
During our analysis of the data we discovered traces pointing towards automated bots being active on the online shop. This is problematic, since it distorts the results of our analysis and the usefulness of input data for our predictive models. The company should consider investing some resources into spotting bot activity and its sources, to exclude data from bot activity from future analysis results as well as from input data for predictive models.
To get an overview of the content of the “0 Data” folder:
The clickstream dataset contains a high quantity of information. Thus, not all attributes are displayed in the summary tables above, but can be seen in the following overview tables.
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| WhichDoYouWearMostFrequent | casual socks: 34.62% | hosiery: 31.14% | athletic socks: 19.15% | trouser socks: 15.09% | 0% | 98.98% | |
| YourFavoriteLegcareBrand | Conair: 15.4% | Nature Made: 12.79% | Epilady: 10.7% | eShave: 7.31% | Lucky Chick: 6.53% | 47.27% | 99.24% |
| Registration Gender | Female: 52.17% | Male: 47.83% | 0% | 99.95% | |||
| NumberOfChildren | 0: 60.87% | 1: 13.04% | 2: 13.04% | 4 or more: 13.04% | 0.01% | 99.95% | |
| DoYouPurchaseForOthers | False: 100% | 0% | 96.96% | ||||
| HowDoYouDressForWork | business casual: 41.57% | very casual: 29.22% | business dress: 15.29% | comfortable / athletic: 13.92% | 0% | 98.99% | |
| HowManyPairsDoYouPurchase | 15 or more: 48.9% | 11 to 15: 30.94% | 1 to 5: 14.37% | 6 to 10: 5.79% | 0% | 99.01% | |
| YourFavoriteLegwearBrand | Hanes: 39.14% | DKNY: 17.22% | Donna Karan: 8.61% | Danskin: 8.22% | Berkshire: 5.09% | 21.72% | 98.99% |
| WhoMakesPurchasesForYou | spouse: 50.51% | friend: 24.24% | parent: 23.23% | siblings: 2.02% | 0% | 99.8% | |
| NumberOfAdults | 2: 43.48% | 3 or more: 34.78% | 1: 21.74% | 0% | 99.95% | ||
| HowDidYouHearAboutUs | other: 38.32% | friend / family: 31.99% | e-mail: 18.15% | print ad: 9.23% | direct mail: 1.19% | 1.12% | 97.34% |
| SendEmail | True: 65.92% | False: 34.08% | 0% | 96.92% | |||
| HowOftenDoYouPurchase | every 6 months: 75.84% | once a year: 16.62% | each week: 7.53% | 0.01% | 99.24% | ||
| HowDidYouFindUs | Friend/Co-worker: 69.57% | Other: 26.09% | News Story: 4.35% | 0% | 99.95% | ||
| City | San Francisco: 2.31% | New York: 2.18% | Chicago: 1.35% | Stamford: 1.09% | Dallas: 0.71% | 92.36% | 96.92% |
| US State | CA: 13.28% | NY: 11.1% | TX: 5.97% | PA: 5.39% | IL: 4.68% | 59.58% | 96.92% |
| COM: 72.1% | NET: 19.69% | Gazelle: 2.95% | EDU: 2.57% | Other: 2.12% | 0.57% | 96.92% | |
| Truck Owner | False: 77.84% | True: 22.16% | 0% | 97.72% | |||
| RV Owner | False: 91.17% | True: 8.83% | 0% | 97.72% | |||
| Motorcycle Owner | False: 98.79% | True: 1.21% | 0% | 97.72% | |||
| Marital Status | Married: 61.8% | Single: 24.7% | Inferred Married: 7% | Inferred Single: 6.5% | 0% | 98.02% | |
| Working Woman | False: 65.54% | True: 34.46% | 0% | 97.72% | |||
| Mail Responder | True: 76.54% | False: 23.46% | 0% | 97.72% | |||
| Bank Card Holder | True: 83.55% | False: 16.45% | 0% | 97.72% | |||
| Gas Card Holder | True: 72.73% | False: 27.27% | 0% | 97.72% | |||
| Upscale Card Holder | True: 51.08% | False: 48.92% | 0% | 97.72% | |||
| Unknown Card Type | False: 60.17% | True: 39.83% | 0% | 97.72% | |||
| TE Card Holder | False: 91% | True: 9% | 0% | 97.72% | |||
| Premium Card Holder | False: 78.96% | True: 21.04% | 0% | 97.72% | |||
| Presence Of Children | False: 50.39% | True: 49.61% | 0% | 97.72% | |||
| Estimated Income Code | $50;000-$74;999: 23.17% | $75;000-$99;999: 17.74% | $40;000-$49;999: 11.41% | $30;000-$39;999: 11.23% | $125;000 OR MORE: 10.34% | 26.11% | 97.78% |
| Home Market Value | $75;000-$99;999: 16.61% | $50;000-$74;999: 15.09% | $100;000-$124;999: 13.68% | $125;000-$149;999: 9.12% | $150;000-$174;999: 7.49% | 38.01% | 98.31% |
| New Car Buyer | True: 100% | 0% | 98.97% | ||||
| Vehicle Lifestyle | IMPORT (STANDARD/ECONOMY): 27.76% | FULL SIZE (STANDARD/LUXURY): 22.64% | TRUCK OR UTILITY VEHICLE: 12.6% | SPECIALTY (MIDSIZE/SMALL): 11.42% | STATION WAGON: 10.83% | 14.75% | 99% |
| Property Type | single family dwelling: 86.59% | condo: 7.53% | 2-4 unit(duplex;triplex;quad): 2.35% | misc. residential (condo store/flat): 1.88% | apartment(5+ units): 0.94% | 0.71% | 99.16% |
| Loan To Value Percent | 0% (NO LOANS): 30.26% | 100-99%: 10.53% | 70-74%: 8.88% | 75-79%: 8.88% | 80-84%: 8.88% | 32.57% | 99.4% |
| Presence Of Pool | False: 98.87% | True: 1.13% | 0% | 97.72% | |||
| Own Or Rent Home | Owner: 93.56% | Renter: 6.44% | 0% | 97.97% | |||
| Mail Order Buyer | True: 64.68% | False: 35.32% | 0% | 97.72% | |||
| DMA No Mail Solicitation Flag | True: 100% | 0% | 97.72% | ||||
| DMA No Phone Solicitation Flag | True: 100% | 0% | 97.72% | ||||
| New Bank Card | False: 100% | 0% | 97.72% | ||||
| Speciality Store Retail | False: 84.59% | True: 15.41% | 0% | 97.72% | |||
| Oil Retail Activity | False: 90.65% | True: 9.35% | 0% | 97.72% | |||
| Bank Retail Activity | False: 77.58% | True: 22.42% | 0% | 97.72% | |||
| Finance Retail Activity | False: 90.3% | True: 9.7% | 0% | 97.72% | |||
| Miscellaneous Retail Activity | False: 94.37% | True: 5.63% | 0% | 97.72% | |||
| Upscale Retail | False: 94.37% | True: 5.63% | 0% | 97.72% | |||
| Upscale Speciality Retail | False: 96.36% | True: 3.64% | 0% | 97.72% | |||
| Retail Activity | False: 64.33% | True: 35.67% | 0% | 97.72% | |||
| Dwelling Size | SINGLE HOUSEHOLD: 75.58% | 2 HOUSEHOLDS: 6.78% | 100+ HOUSEHOLDS: 3.53% | 3 HOUSEHOLDS: 2.41% | 10-19 HOUSEHOLDS: 1.95% | 9.75% | 97.87% |
| Lendable Home Equity | EQUITY LESS THAN OR EQUAL $0: 33.22% | EQUITY $10;000-$19;9999: 9.54% | EQUITY $1-$4;999: 8.55% | EQUITY $75;000-$99;999: 8.55% | EQUITY $100;000-$149;999: 8.22% | 31.92% | 99.4% |
| Home Size Range | 1;250-1;499 FT: 16.04% | 2;000-2;499 FT: 15.41% | 1;000-1;249 FT: 14.47% | 1;500-1;749 FT: 13.84% | 1;750-1;999 FT: 10.69% | 29.55% | 99.37% |
| Lot Size Range | 1 ACRE OR LESS: 89.58% | GREATER THAN 1 ACRE: 10.42% | 0% | 99.43% | |||
| Dwelling Unit Size | SINGLE FAMILY DWELLING UNIT: 74.03% | MULTI FAMILY DWELLING UNIT: 25.97% | 0% | 97.81% | |||
| Available Home Equity | EQUITY $30;000-$49;000: 19.67% | EQUITY $50;000-$74;000: 18.56% | EQUITY $75;000-$99;999: 11.6% | EQUITY $20;000-$29;000: 11.27% | EQUITY $100;000-$149;999: 11.05% | 27.85% | 98.21% |
| Minority Census Tract | False: 98.61% | True: 1.39% | 0% | 97.72% | |||
| Gender | Female: 83.33% | Male: 16.67% | 0% | 98.43% | |||
| Occupation | PROFESSIONAL/TECHNICAL: 32.33% | HOUSEWIFE: 15.54% | ADMINISTRATIVE/MANAGERIAL: 14.29% | CLERICAL/WHITE COLLAR: 11.28% | STUDENT: 6.02% | 20.54% | 99.21% |
| Other Indiv Gender | Male: 81.91% | Female: 18.09% | 0% | 98.84% | |||
| Other Indiv Occupation | PROFESSIONAL/TECHNICAL: 45.9% | ADMINISTRATIVE/MANAGERIAL: 17.93% | CRAFTSMAN/BLUE COLLAR: 13.68% | SALES/SERVICE: 6.38% | CLERICAL/WHITE COLLAR: 4.86% | 11.25% | 99.35% |
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| Year of Birth | 1979 | 1966.58 | 1966 | 1948 | 7.95 |
| Value Of All Vehicles | 99000 | 19111.56 | 16000 | 1000 | 14190.98 |
| Age | 86 | 37.58 | 36 | 18 | 10.71 |
| Other Indiv Age | 86 | 40.90 | 38 | 18 | 11.51 |
| Number Of Adults | 6 | 2.54 | 2 | 1 | 1.35 |
| Year House Was Built | 1997 | 1963.50 | 1968 | 1850 | 27.02 |
| Length Of Residence | 15 | 6.77 | 6 | 0 | 4.39 |
| Year Home Was Bought | 1999 | 1991.18 | 1993 | 1954 | 6.34 |
| Home Purchase Date | 199906 | 199121.06 | 199300 | 195400 | 633.82 |
| Number Of Vehicles | 3 | 1.47 | 1 | 1 | 0.61 |
| CRA Income Classification | 4 | 3.31 | 3 | 1 | 0.63 |
| Number Of Credit Lines | 9 | 2.72 | 3 | 1 | 1.61 |
| Dataquick Market Code | 10 | 4.61 | 4 | 1 | 2.53 |
| Insurance Expiry Month | 12 | 6.54 | 6 | 1 | 3.45 |
| Month Home Was Bought | 12 | 7.03 | 7 | 1 | 3.42 |
| Year Of Structure | 1999 | 1973.29 | 1980 | 1900 | 26.47 |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| BrandName | DKNY: 9.99% | Silk Reflections: 9.21% | ORO: 8.85% | HPK: 7.22% | AME: 7.13% | 57.6% | 86.16% |
| PrimaryPackage | Bottle: 35.18% | Tube: 30.46% | Jar: 25.55% | Box: 5.4% | Spray: 3.41% | 0% | 96.07% |
| StockType | Replenishable: 60.93% | Seasonal 1: 23.75% | Seasonal 1*: 11.25% | Seasonal 2: 2.07% | Replenishment: 1.83% | 0.17% | 80.3% |
| ProductForm | Cream: 54.01% | Liquid: 24.19% | gel: 7.58% | Lotion: 7.36% | Capsule: 5.97% | 0.89% | 96.64% |
| Look | Sheer: 83.41% | Ultra Sheer: 13.1% | Opaque: 3.48% | 0.01% | 94.22% | ||
| BasicOrFashion | Basic: 92.33% | Fashion: 7.67% | 0% | 86.11% | |||
| MfgStyleCode | Tricot: 2.34% | BC27340: 1.45% | 00N02: 1.35% | 00Q63: 1.34% | 5751: 1.2% | 92.32% | 82.06% |
| SaleOrNonSale | NSALE: 100% | 0% | 94.46% | ||||
| HasDressingRoom | False: 73.56% | True: 26.44% | 0% | 86.09% | |||
| ColorOrScent | Scent: 85.69% | Color: 14.31% | 0% | 99.7% | |||
| Texture | Flat: 66.48% | Textured: 33.52% | 0% | 96.65% | |||
| Manufacturer | Donna Karan Company: 10.79% | Peneco: 9.08% | HAN: 8.66% | Kneipp: 6.61% | Paul Lavitt Mills Inc.: 6.54% | 58.32% | 80.12% |
| ToeFeature | SF: 86.04% | RT: 13.96% | 0% | 93.49% | |||
| Category2 | Gift Sets & Special Items: 32.01% | Skincare: 23.28% | Cellulite & Other Treatments: 22.82% | Footcare: 14.47% | Health Supplements: 6.2% | 1.22% | 99.21% |
| Material | Cotton: 66.35% | Nylon: 23.62% | Coolmax: 3.71% | Rayon: 1.49% | Lycra: 1.04% | 3.79% | 93.18% |
| CategoryCode | PH: 33.89% | WDCS: 12.66% | TH: 7.83% | FO: 6.76% | TT: 6.2% | 32.66% | 86.09% |
| WaistControl | CT: 76.17% | STW: 23.83% | 0% | 94.34% | |||
| Collection | Oroblu Italian Hosiery: 6.31% | Conversationals: 5.17% | DKNY Skin: 4.41% | Action Pack 3-Pair: 3.9% | Womens Dance: 3.84% | 76.37% | 86.96% |
| BodyFeature | MBC: 64.5% | UBC: 17.51% | LBC: 11.17% | BS: 6.82% | 0% | 98.49% | |
| Audience | Women: 80.86% | Men: 10.21% | Children: 8.93% | 0% | 86.09% | ||
| Category1 | Skincare: 60.62% | Footcare: 18.05% | Cellulite & Other Treatments: 15% | Hair Removal: 3.89% | Health Supplements: 1.83% | 0.61% | 94% |
| Product | Cellulite Trimming Gel: 3.25% | Body Lotion - Oceanic Minerals: 2.74% | Kit-Firming Cream/Slimming Cream/Shorts: 2.57% | Body Silk: 2.41% | Herbal Foot Balm: 2.31% | 86.72% | 94% |
| Pattern | Solid: 58% | Conversational: 39.12% | Floral: 2.03% | Stripe: 0.55% | Herringbone: 0.18% | 0.12% | 93.72% |
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| UnitsPerInnerBox | 12.0 | 4.44 | 3.00 | 1.00 | 2.91 |
| Depth | 16.0 | 2.73 | 2.50 | 0.50 | 2.15 |
| VendorMinREOrderDollars | 500.0 | 161.15 | 150.00 | 100.00 | 82.29 |
| Height | 8.5 | 1.18 | 0.75 | 0.25 | 1.20 |
| UnitsPerOuterBox | 144.0 | 18.57 | 12.00 | 4.00 | 16.83 |
| Pack | 3.0 | 1.14 | 1.00 | 1.00 | 0.50 |
| Length | 16.5 | 9.12 | 9.25 | 3.50 | 1.65 |
| MinQty | 144.0 | 16.48 | 6.00 | 0.00 | 30.45 |
| LeadTime | 28.0 | 10.95 | 10.00 | 1.00 | 6.41 |
| Weight | 40.0 | 4.98 | 2.60 | 0.40 | 5.59 |
| Width | 18.0 | 5.68 | 6.25 | 0.50 | 1.93 |
| UnitIncrement | 36.0 | 4.81 | 3.00 | 1.00 | 4.38 |